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ABSTRACT 

Most previous work on the recently developed language- 
modeling approach to information retrieval focuses on docu- 
ment-specific characteristics, and therefore does not take 
into account the structure of the surrounding corpus. We 
propose a novel algorithmic framework in which information 
provided by document-based language models is enhanced 
by the incorporation of information drawn from clusters of 
similar documents. Using this framework, we develop a suite 
of new algorithms. Even the simplest typically outperforms 
the standard language-modeling approach in precision and 
recall, and our new interpolation algorithm posts statisti- 
cally significant improvements for both metrics over all three 
corpora tested. 

Categories and Subject Descriptors 

H3.3 [Information Search and Retrieval]: Language 
models, clustering, smoothing 

General Terms 

Algorithms, Experiments 

Keywords 

language modeling, aspect models, interpolation model, clus- 
tering, smoothing, cluster-based language models 

1. INTRODUCTION 

As is well known, a basic problem in information retrieval 
is to determine how relevant a particular document is to a 
query. In the automatic ad hoc retrieval setting, examples of 
relevant documents are not supplied. Given this absence of 
explicit relevance evidence, it is important to consider what 
other information sources can be exploited. 

In methods patterned after the classic tf.idf document- 
vector approach to text representation, the focus is mostly 
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on utilizing within-document features, such as term frequen- 
cies. Information drawn from the corpus as a whole gener- 
ally consists of aggregates of statistics gathered from each 
document considered in isolation; for example, the inverse 
document frequency is based on checking, for each docu- 
ment, whether that document contains a particular term. 

Recent work has demonstrated the effectiveness of an al- 
ternative approach wherein probabilistic models of text gen- 
eration are constructed from documents, and these induced 
language models (LMs) are used to perform document rank- 
ing |15ll5j. Like tf.idf and related techniques, though, lang- 
uage-modeling methods typically use only individual-docu- 
ment features and corpus- wide aggregates of the same. (Cor- 
pus term counts are generally employed for smoothing, so 
that unseen text can be assigned non-zero probability.) 

Neither of the aforementioned approaches typically makes 
use of a potentially very powerful source of information: the 
similarity structure of the corpus. Clusters are a convenient 
representation of similarity whose potential for improving 
retrieval performance has long been recognized |Hll(ij. From 
our point of view, one key advantage is that they provide 
smoothed, more representative statistics for their elements, 
as has been recognized in statistical natural language pro- 
cessing for some time For example, we could infer that 
a document not containing a certain query term is still rele- 
vant if the document belongs to a cluster whose component 
documents generally do contain the term. 

However, relying on clusters alone has some potential draw- 
backs. Clustering at retrieval time can be very expensive, 
but off-line clustering seems, by definition, query-independent 
and therefore may be based on factors that are irrelevant to 
user information need. Also, cluster statistics may over- 
generalize with respect to specific member documents. 

We therefore propose a framework for incorporating both 
corpus-structure information — using pre-computed, over- 
lapping clusters — and individual-document information. 
Importantly, although cluster formation is query-independent, 
within our framework the choice of which clusters to incor- 
porate can depend on the query. We then consider several 
of the many possible algorithms arising as specific instan- 
tiations of our framework. These include both novel meth- 
ods and, as special cases, both the standard, non-cluster- 
based LM approach and a variant of the cluster-based aspect 
model 0. 

Our empirical evaluation consists of experiments in an ar- 
ray of settings created by varying several parameters and 
meta-parameters; these include the corpus, the information 



representation (e.g., language models versus tf.idf-style vec- 
tors), and, when applicable, the smoothing method selected. 
We find that even the worst-performing of our novel algo- 
rithms is competitive with the LM approach, and indeed al- 
ways provides substantial improvement in recall. In general, 
our algorithms provide good performance in comparison to a 
number of recently proposed methods, thus demonstrating 
that our integration approach to incorporating document 
and corpus-structure information is an effective way to im- 
prove ad hoc retrieval. 

Notational conventions. We use d, q, c and C to denote a 
document, query, cluster, and corpus, respectively. A fixed 
vocabulary is assumed. We use the notation Pd(-) for the 
language model — which assigns probabilities to text strings 
over the fixed vocabulary — induced from d by some pre- 
specified method, and p c (-) for the language model induced 
from c. (Section Q3 describes the induction methods we used 
in our experiments.) 

It is convenient to use Kronecker delta notation S[s] to set 
up some definitions. The argument s is a statement; S[s] = 1 
if s holds, otherwise. 

2. RETRIEVAL FRAMEWORK 

As noted above, when we rank documents with respect 
to a query, we desire per-document scores that rely both on 
information drawn from the particular document's contents 
and on how the document is situated within the similarity 
structure of the ambient corpus. 

Structure representation via overlapping clusters. Doc- 
ument clusters are an attractive choice for representing cor- 
pus similarity structure (see |16l chapter 3] for extended dis- 
cussion) . Clusters can be thought of as facets of the corpus 
that users might be interested in. Given that a particular 
document can be relevant to a user for several reasons, or 
to different users for different reasons, we believe that a set 
of overlapping clusters 1 forms a better model for similarity 
structure than a partitioning of the corpus. Furthermore, 
employing intersecting clusters may reduce information loss 
due to the generalization that clustering can introduce |16l 
pg. 44]. 

Information representation. Motivated by the empirical 
successes of language- modeling-based approaches |15ll5). we 
use language models induced from documents and clusters as 
our information representation. Thus, Pd(q) and p c (q) spec- 
ify our initial knowledge of the relation between the query 
q and a particular document d or cluster c, respectively. 
(However, Section |S] shows that using a tf.idf representa- 
tion also yields performance improvements with respect to 
the appropriate baseline, though not to the same degree as 
using language models does.) 

Information integration. To assign a ranking to the docu- 
ments in a corpus C with respect to q, we want to score each 
d £ C against q in a way that incorporates information from 
query-relevant corpus facets to which d belongs. While one 
could compute clusters specific to q at retrieval time, effi- 
ciency considerations compel us to create Clusters(C), the 

1 We include soft or probabilistic clusters in this category. 



set of clusters, in advance, and hence in a query-independent 
fashion. To compensate, at retrieval time we base the choice 
of appropriate facets on the query. 

How might cluster information be used? Our discussion 
above indicates that clusters can serve two roles. Insofar 
as they approximate true facets of the corpus, they can aid 
in the selection of relevant documents: we would want to 
retrieve those that belong to clusters corresponding to facets 
of interest to the user. On the other hand, clusters also 
have the capacity to smooth individual-document language 
models, since they pool statistics from multiple documents. 
Finally, we must remember that over-reliance on p c (q) can 
over-generalize by failing to account for document-specific 
information encoded in Pd(q)- 

These observations motivate the algorithm template shown 
in Figure This template is fairly general: both the stan- 
dard language-modeling approach [J5j and the aspect model 
[2] are concrete instantiations. In the template, the choice 
of Facets, (d) corresponds to utilizing clusters in their selec- 
tion role. The scoring step can be thought of as integrating 
Pd{q) with cluster-based language models in their smooth- 
ing role. The optional re-ranking step is used as a way to 
further bias the final ranking towards document-specific in- 
formation, if desired. Note that re-ranking can change the 
average non-interpolated precision but not the absolute pre- 
cision or recall of the retrieval results; we therefore use it, 
when necessary, to enhance average precision. (Section |H] 
reports experiments studying its efficacy.) 



Figure 1: Algorithm template. 

In the next section, we describe a number of specific al- 
gorithms arising from this template, concentrating on their 
degree of dependence on cluster-induced language models. 

3. RETRIEVAL ALGORITHMS 

Table summarizes the algorithms we consider, which 
represent a few choices out of the many possible ways to in- 
stantiate the template of FigureQ Our preference in picking 
these algorithms has been towards simpler methods, so as to 
focus on the impact of using cluster information (as opposed 
to the impact of tuning many weighting parameters) . 

First step: Cluster formation and selection. There are 
many algorithms that can be used to create Clusters(C), the 
set of overlapping document clusters required by Figure s 
template. In our experiments, we simply have each docu- 
ment d form the basis of a cluster Cohort (d) consisting of d 
and its k — 1 nearest neighbors, where k is a free parameter. 
(Note that two clusters with different basis documents may 



Offline: Create Clusters(C) 

Given q and N , the number of documents to retrieve: 
For each d £ C, 

Choose a cluster subset Facets, (d) C Clusters(C) 
Score d by a weighted combination of pd(q) and 
the p c (q)'s for all c £ Facets, (d) 
Set TopDocs(iV) to the rank-ordered list of N top- 
scoring documents 
Optional: re-rank d £ TopDocs(iV) by Pd(q) 
Return TopDocs(iV) 





Facets, (d) 


Score 


Re-rank by f>d(q)? 


LM 


N/A 


Pd(q) 


(redundant) 


basis-select 
sei-seiec l 
bag-select 


{Cohort (d)\ n TopClusters g (m, 
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p d (g) ■ | Facets, (d) 


(redundant) 

yes 


uniform- asp ect-x 
aspect-x 


(c : i £ cj n lopClusterSg (m) 
{c : d £ c} n TopClusters 9 (m) 


2^cgFacets,j(d) Pel,?; 
EceFacets 3 (d)Pc(?)-Pc(rf) 


yes 
yes 


interpolation 


{c : d £ c} n TopClusters^ (m) 


A • Pd(«) + (1 - A) EceFacets<,(d) ' 


no 




Table 1: 


Algorithm specifications. 





contain the same set of documents.) Inter-document dis- 
tance is measured by the Kullback-Leibler (KL) divergence 
between the corresponding (smoothed) language models, as 
in El. 

The idea behind our use of cohorts is that a document's 
nearest neighbors in similarity space represent a local "frag- 
ment" or "tile" of the overall similarity structure of the cor- 
pus. Our evaluation results show that even this relatively 
unsophisticated way to approximate facets enables effective 
leveraging of corpus structure; at the very least, it serves as 
a form of nearest- neighbor smoothing (see below). 

The first retrieval-time action specified by our algorithm 
template is to choose Facets, (d), a query-dependent subset 
of Clusters(C). In all the algorithms described below ex- 
cept the baseline (which doesn't use cluster information), 
there is a document-selection aspect to this subset, in that 
only documents in some c £ Facets, (d) can appear in the 
final ranked- list output. Ideally, we would use the clusters 
best approximating those (true) facets of the corpus that 
are most representative of the user's interests, as expressed 
by q; therefore, we require that Facets, (d) be a subset of 
TopClusters (J (m), the top m clusters c with respect to p c {q). 
But we also want to evaluate d only with respect to the 
facets it actually exhibits. Thus, in what follows (except for 
the baseline), Facets, (d) is always defined to be a subset of 
{c : d € c} n TopClusters 9 (m); we assume m is large enough 
to produce the desired number of retrieved documents N. 

Baseline method. The baseline for our experiments, de- 
noted LM, is to simply rank documents by Pd{q) — no 
cluster information is used. Details of our particular im- 
plementation are given in Section [S] 

Selection methods. In this class of algorithms, the cluster- 
induced language models play a very small role once the set 
Facets, (d) is selected. In essence, the standard language- 
modeling approach (that is, ranking by Pd{q)) is invoked 
to rank (some of) the documents comprising the clusters in 
Facets, (d). This method of scoring is intended to serve as 
a precision-enhancing mechanism, downgrading documents 
that happen to be members of some c € Facets, (d) by dint 
of similarity to d in respects not pertinent to q. 

In the basis-select algorithm, the net effect of the def- 
inition given in Table Q is that only the basis documents 
of the clusters in TopClusters^ (m) are allowed to appear in 
the final output list. Thus, this algorithm uses the pooling 
of statistics from documents in Cohort (d) simply to decide 
whether d is worth ranking; the rank itself is based solely 
on p d (q). 

The set-select algorithm differs in that all the documents 



in the clusters in TopClusters (J (m) may appear in the final 
output list — the "set" referred to in the name is the union 
of the clusters in TopClusters^ (m) . The idea is that any 
document in a "best" cluster, basis or not, is potentially 
relevant and should be ranked. Again, the ranking of the 
selected documents is by Pd(q)- 2 

Another natural variant of the same idea is that docu- 
ments appearing in more than one cluster in TopClusters, (m) 
should get extra consideration, given that they appear in 
several (approximations of) facets thought to be of interest 
to the user. This idea gives rise to the bag-select algorithm, 
so named in reference to the incorporation of a document's 
multiplicity in the bag formed from the "multi-set union" 
of all the clusters in Facets, (d). First, each selected docu- 
ment d is assigned a score consisting of the product of its 
language-modeling score Pd(q) and the number of "top" clus- 
ters it belongs to. The iV top-scoring documents are then 
re-ranked via Pd(q) and presented in the new sorted order. 

Aspect-X methods. We now turn to algorithms making more 
explicit use of clusters as smoothing mechanisms. In par- 
ticular, we study what we term "aspect-x" methods. Our 
choice of name is a reference to the work of Hofmann and 
Puzicha 0, which conceives of clusters as explanatory latent 
variables underlying the observed data. (The "x" stands for 
"extended"). In our setting, this idea translates to using 
p c (q) as a proxy for Pd(q), where the degree of dependence 
on a particular p c (q) is based on the strength of association 
between d and c. The aspect-x algorithm measures this 
association by p c (d); the uniform-aspect-x algorithm as- 
sumes that every d £ c has the same degree of association 
to c. In both cases, re-ranking by Pd{q) is applied. 

The scoring function we use for our aspect-x algorithm can 
be motivated by appealing to the probabilistic derivation of 
the aspect model as follows. It is a fact that 

p(q\d) = ^p(q\d,c)p(c\d). (1) 

c 

The aspect model assumes that a query is conditionally in- 
dependent of a document given a cluster (which is a way 
of using clusters to smooth individual-document statistics), 
in which case p(q\d) = ~}2 c p(q\c)p(c\d). If we further as- 
sume that p(d) and p(c) are constant, we can write p(q\d) — 

2 Because our implementation treats clusters and their com- 
ponent documents in a "fifo" manner, it deviates slightly 
from the template. Let N' be the number of documents in 
the m — 1 highest-ranked clusters. Then, only the TV — N' 
documents in the m'th cluster that are closest, in the KL- 
divergence sense, to the cluster's basis are allowed into 
TopDocs(TV). 



a^2 c p(q\c)p(d\c), where a is a constant that doesn't affect 
ranking. Our aspect-x algorithm then arises by replacing the 
conditional probabilities with the corresponding language 
models and only summing over the clusters in Facets, (d). 
Constraining which clusters participate in the sum to those 
of relatively high rank is important: experiments indicate 
that using a large number of clusters could be detrimen- 
tal. We note, however, that it appears difficult within the 
strictly probabilistic framework of the original aspect model 
to incorporate such a constraint: a particular cluster's rank 
depends on all the other clusters, but none of the terms 
in the basic aspect-model equation explicitly conditions on 
them. 

A hybrid algorithm. The selection-only algorithms empha- 
size Pd(q) in scoring a document d; in contrast, the aspect-x 
algorithms rely on p c (q)- We created the interpolation al- 
gorithm to combine the advantages of these two approaches. 

The algorithm can be derived by dropping the original as- 
pect model's conditional independence assumption — namely, 
that p(q\d,c) = p(q\c) — and instead setting p(q\d,c) in 
EquationQto \p(q\d) + (1 — A)p(g|c), where A indicates the 
degree of emphasis on individual-document information. If 
we do so, then via some algebra we get p(q\d) = Xp(q\d) + 
(1 — A) ^2 c p(q\c)p(c\d). Finally, applying the same assump- 
tions as described in our discussion of the aspect-x algorithm 
yields a score function that is the linear interpolation of the 
score of the standard LM approach and the score of the 
aspect-x algorithm. Note that no re-ranking step occurs; as 
we shall see, the interpolation algorithm's incorporation of 
document-specific information yields higher precision. 

4. RELATED WORK 

Document clustering has a long history in information re- 
trieval |H I16|: in particular, approximating topics via clus- 
ters is a recurring theme |17) . Arguably the work most re- 
lated to ours by dint of employing both clustering and lan- 
guage modeling in the context of ad hoc retrieval 3 is that 
on latent-variable models, e.g., [5J |SJ 1121 of which the 
classic aspect model is one instantiation. Such work takes a 
strictly probabilistic approach to the problems we have dis- 
cussed with standard language modeling, as opposed to our 
algorithmic viewpoint. Also, a focus in the latent-variable 
work has been on sophisticated cluster induction, whereas 
we find that a very simple clustering scheme works rather 
well in practice. Interestingly, Hofmann 8 linearly interpo- 
lated his probabilistic model's score, which is based on (soft) 
clusters, with the usual cosine metric; this is quite close in 
spirit to what our interpolation algorithm does. 

Implicit corpus structure is also exploited by Lafferty and 
Zhai's expanded query language model Their method 

uses interleaved document-term Markov chains (which can 
be thought of as tracing "paths" between related documents) 
to enhance language models built from queries. This is sim- 
ilar conceptually to our framework's use of inter-document 
similarities to enhance the performance of document lan- 
guage models, although in our work the notion of similarity 
is more explicit. 



3 See e.g., [3], |10|. and [Bj for applications of clustering in 
related areas. 



5. EXPERIMENTAL SETUP 

Data. We conducted our experiments on TREC data. We 
used titles (rather than full descriptions) as queries, result- 
ing in an average length of 2-5 terms. Some characteristics 
of our three corpora are summarized in the following table. 



corpus # of docs queries previous work 

AP89 84,678 1-46,48-50 Lafferty & Zhai 11 

AP88+89 164,597 101-150 Lavrenko & Croft [T5] 

LA+FR 187,526 401-450 

The first two data corpora, AP89 and AP88+89, were cho- 
sen because they have served as data for previous research 
on state-of-the-art algorithms somewhat related to but con- 
siderably extending the basic LM approach. We used the 
same stemming and stopword-removal policies as in those 
previous experiments; hence, we applied the Porter stemmer 
to the AP89 collection (disk one), and we ran the Krovetz 
stemmer on AP88+89 and removed both INQUERY stop- 
words pQ and length-one tokens. LA+FR (disk 5 and 4, 
respectively), which is part of the TREC-8 corpus (we used 
TREC-8 ad hoc queries) , was neither stemmed nor subjected 
to stopword removal. This corpus is more heterogeneous 
than the other two. 

Induction of base language models. Unless otherwise 
specified, we use unigram Dirichlet-smoothed language mod- 
els (which were previously shown to yield good performance 
for short queries |19|) in the following manner. For the pur- 
poses of this discussion, we use the term "document" and 
notation d to refer either to a true document in the corpus C 
or to a query. Let f(x £ y) be the number of times word x 
occurs in item y. For a text sequence w = W\W2 ■ ■ • w n , the 
Dirichlet-smoothed language model induced from d assigns 
the following probability to w: 

Dir, deS TT f{Wi 6 d) + p. ■ Pc L (Wi) 
Pd M = II ^ 77 ~ ,x T , 

where the free parameter p controls the degree to which the 
document's statistics are altered by the overall corpus statis- 
tics, and "ML" indicates the maximum-likelihood estimate. 
Then, for any two documents d and d' , we set Pd(d') to 

exp(-£ (p h J L {-)\\pT r {-))) 

(normalizing when appropriate), where D is the Kullback- 
Leibler divergence. This formulation is actually equivalent 
to a log-likelihood criterion under certain assumptions 
but in practice is less sensitive than p^ r (d') to variations 
in the length of d' . 

For a given cluster c, the corresponding language model 
Pc(-) is induced by concatenating c's component documents 
and then applying the document-LM induction method to 
the new "document". 

Reference comparisons. While one of our goals is to demon- 
strate that incorporating corpus structure as in our retrieval 
framework can provide improvements over the performance 
of the standard LM algorithm, we also wish to detemine 
whether our algorithms are competitive with state-of-the-art 
language-modeling-based algorithms. One natural choice for 





Baseline: LM 


basis-S 


sct-S 


bag-S 


uniform 


aspcct-x 


intcrp. 


Pseudo-feedback Markov chains 


Avg. Prec. 


21.03% 


22.1%* 


22.45%* 


22.3%* 


21 . 7% 


22.6%* 


24.9%* 


23.2% 


Prec. at 


57.4% 


58.5% 


58.5% 


58.1% 


57.2% 


58.2% 


55.8% 


53.4% 


Recall 


48.67% 


54.86% 


56.15% 


62.77%* 


57.31% 


62.16%* 


63.62%* 


61.91% 



Table 2: AP89 results (3261 relevant documents). Cluster size k = 40; interpolation parameter A = 0.4. 





Baseline: LM 


basis-S 


set-S 


bag-S 


uniform 


aspcct-x 


intcrp. 


Relevance model 


Avg. Prec. 


24.37% 


26.58%,* 


28.11%* 


26.65%* 


24.92% 


27.5%* 


31.28%* 


26.17% 


Prec. at 


65.52% 


65.77% 


65.32% 


65.17% 


67.5% 


65.9% 


65.82% 


61.61% 


Recall 


66.53% 


71.86% 


76.48%* 


79.23%* 


69.05% 


78.9%* 


80.83%* 


77.69%, 



Table 3: AP88+89 results (4805 relevant documents). Cluster size k — 40; interpolation parameter A = 0.6. 



comparison is Lafferty and Zhai's pseudo-feedback Markov 
chains algorithm which extends the expanded query 
language model described above by forcing the chains to 
pass through top-ranked documents, as determined using 
the standard LM approach. Another obvious candidate is 
Lavrenko and Croft's relevance model |I3| which was the first 
method to explicitly incorporate relevance into the language- 
modeling framework, and which demonstrated excellent per- 
formance. Note that both algorithms, in contrast to our 
framework, depend on pseudo-feedback mechanisms to cope 
with the lack of true user feedback. 

Implementation. We used the Lemur toolkit [Tlj to run 
our experiments. Our implementations of the baseline used 
optimized smoothing-parameter settings with respect to av- 
erage non-interpolated precision 4 , computed via line search. 
For our novel algorithms, we optimized the cluster-size pa- 
rameter k and the interpolation algorithm's interpolation 
parameter A, but the other parameters were set to default 
values suggested in the previous literature |19|: thus, the 
baseline algorithm was given an extra advantage. 

Rather than re-implement the pseudo-feedback Markov- 
chain and relevance-model algorithms described above, we 
report results presented in the previous literature I13|. 
We do realize that minor differences in performance could 
stem from specific implementation issues, but as stated above, 
our goal was to test the competitiveness of our algorithms' 
performance with respect to that of other prominent algo- 
rithms, not to prove our algorithms' superiority. 

6. EXPERIMENTAL RESULTS 

For our evaluation measures, we used average non-inter- 
polated precision, interpolated precision at 0, and recall, all 
for N — 1000 selected documents. Our main experimental 
results are given by Tables 0E] and 2] and Figure |5] 

In the tables, for each evaluation metric, the strongest 
performance is boldfaced and all results above the baseline 
(LM) are italicized. Also, the Wilcoxon two-sided test was 
employed with significance threshold p — 0.05 — all statis- 
tically significant performance improvements and degrada- 
tions for our algorithms relative to the baseline are marked 
with a star (*). 

Clearly, at the indicated settings (given in the captions), 

4 Optimization with respect to recall yielded results which 
were statistically indistinguishable with respect to each of 
our performance metrics. 



even at worst our algorithms are always competitive with 
the baseline LM approach, and with occasional exceptions 
(mostly for precision at 0) generally do better. We also 
observe that the aspect-x and interpolation algorithms are 
competitive with the pseudo-feedback Markov-chains algo- 
rithm (see Table |2J and the relevance-model algorithm (see 
Table [3J with respect to all performance measures. 

Figure |21 shows 11-point precision/recall curves for our al- 
gorithms and the baseline. In all three corpora, the interpo- 
lation algorithm does best overall. On AP88 and AP88+89, 
our cluster-based algorithms on the whole generally perform 
demonstrably better than the baseline. In LA+FR, however, 
the new algorithms, with the exception of the interpolation 
algorithm, seem difficult to distinguish from LM, as is borne 
out by the relative lack of statistical-significance indications 
in Table H 

The fact that the aspect-x algorithm was usually supe- 
rior to uniform-aspect-x indicates that incorporating within- 
cluster structure, as represented by p c (d), is important. 

Finally, the generally high performance of our aspect-x 
and interpolation algorithms seems to support our claims 
as to the importance of using corpus-structural information 
in the particular ways we have suggested: specifically, in 
these two algorithms, clusters play both a selection and a 
smoothing role, and both document-specific information and 
intra-cluster structure are incorporated as well. 

In what follows, we discuss the results of further experi- 
mental studies. For space reasons, we present only a subset 
of the performance figures for a selection of corpora. 

Parameter selection. The cluster-size parameter k does 
have a noticeable impact on performance. A series of prelim- 
inary experiments (whose results are omitted due to space 
restrictions) indicate that small values of k (e.g., 5 or 10) 
yield better results than the baseline LM for all but the 
uniform-aspect-x method, demonstrating the usefulness of 
even tiny document clusters. However, increasing k to 40 
resulted in superior performance on the AP89 and AP88+89 
datasets, which suggests that the re-rank step of our algo- 
rithm template can compensate to a degree for the extra 
irrelevant documents that large clusters may bring into con- 
sideration. 

We must also choose m, the number of clusters to be re- 
trieved, recalling that we wish to return a fixed number 
N — 1000 of documents. In the experimental results re- 
ported above, two different schemes were used. For the al- 





Baseline: LM 


basis-S 


set-S 


bag-S 


uniform 


aspect-x 


interp. 


Avg. Prcc. 


22.16% 


21.92% 


22.52%, 


22% 


21.73% 


22.45% 


23.88%* 


Prec. at 


57.37% 


57.91% 


57.89%, 


58.16% 


57.28% 


58.25% 


57.81% 


Recall 


48.31% 


55.64% 


58.09% 


53.34%, 


58.09% 


63.7%* 


61.18%* 



Table 4: Results for LA+FR (1391 relevant documents). Cluster size k = 10; interpolation parameter A = 0.8. 
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AP88+89 



LA+FR 





Figure 2: 11-point precision/recall curves. For AP89 and AP88+89, k = 40; for LA+FR, k = 10. 



gorithms using clusters solely for selection, we set m to ei- 
ther 1000 or the minimum value needed for there to be 1000 
documents receiving a non-zero score. 5 For the remaining 
algorithms (aspect-x, uniform-aspect-x, and interpolation), 
we set m — 10000. The former group of algorithms were 
more sensitive to choice of m than the latter, where as long 
as m did not exceed 10000, satisfactory improvements with 
respect to the baseline algorithms were observed. However, 
drawing upon more clusters than this — which in a sense is 
what the classic aspect model [5J does — was clearly detri- 
mental for some of the data corpora. 

An important regard in which the interpolation algorithm 
differs from the other methods we have introduced is in its 
inclusion of an additional free parameter A, representing the 
degree of dependence on Pd(q) relative to the aspect-x al- 
gorithm. Figure |H] plots the "trajectories" of the interpola- 
tion algorithm through performance space as A is increased. 
This figure makes visually clear the interplay between clus- 
ter and document information: small A's (emphasizing clus- 
ters) result in better recall but relatively poor precision; 
but large A's (emphasizing documents) improve precision at 
the expense of recall. The performance of "average" values 
(around .6) shows that integrating document- and cluster- 
level information provides better performance than either 
can produce alone. 

We note that the aspect-x algorithm can be viewed as a 
version of the interpolation algorithm in which A = and 
re-ranking is added to improve average precision. In return 
for some performance degradation relative to the interpola- 
tion algorithm, it offers the advantage of having one fewer 
parameter to tune, and is fairly robust to m's value as well. 

The re-rank Step. How important is the re-ranking step, 
in which the top-ranked documents are re-scored by their 
document-specific language models, to producing good pre- 
cision? We ran several experiments to explore this issue. 
First, we observed considerable degradation in average 

1000, although 





AP89 


AP88+89 


LA+FR 


rc-rank? 


yes 


no 


yes 


no 


yes 


no 


Avg. Prcc. 


22.6% 


19.8% 


27.5% 


26.95% 


22.45% 


16.01% 


Prcc. at 


58.2% 


46.98% 


65.9% 


65.21% 


58.25% 


46.92% 



5 In the bag-select algorithm we chose m 
lower values would have sufficed. 



Table 5: Effect of re-rank step on aspect-x precision. 
For AP89 and AP88+89, k=40; for LA+FR, k=10. 



precision if we removed the pd (q) term from the score func- 
tions of the basis-select and set-select algorithms, for which 
re-ranking is redundant. Note that this version of the basis- 
select algorithm corresponds to applying the basic LM ap- 
proach to the "document" Cohort (d) rather than d itself, 
and so can be thought of as a smoothing method wherein 
the document language model is created by backing off com- 
pletely to a cluster language model. 

Next, we examined the role of the optional re- ranking step 
in the algorithms that explicitly incorporate it. When the 
aspect-x and uniform-aspect-x algorithms — the two cases 
in which the scoring function does not incorporate Pd(q) — 
were run without the optional re- ranking phase, low average 
precision and precision at resulted, implying that reliance 
onp c (-) alone suffers from over-regularization; the results for 
the aspect-x algorithm are shown in Table Furthermore, 
re-ranking is also required to achieve reasonable precision for 
the bag-select algorithm, even though its scoring function 
incorporates Pd(q)'- when re-ranking is not applied, average 
precision suffers when clusters are small. 

In the case of the interpolation algorithm, however, the 
additional re-rank phase is not needed as long as the inter- 
polation weight A for the document-based language model is 
large enough. This can be seen in Figure^] where the differ- 
ence between average precision without and with re-rank at 
different values of A is reported — observe that for A > 0.4, 
re-ranking degrades performance. 

These results suggest that (1) for best results, it is impor- 
tant to strike the right balance between document-specific 
and inter-document information, and (2) for some algorithms, 
re-ranking creates this balance, but in others it can upset it. 
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Figure 3: Interpolation algorithm's recall vs average precision as A grows (increments of .1 until .9, then 
.925, .95, .975, .98, .99). Recall that A = 1 would yield the baseline language-model scoring function. Similar 
patterns were observed on the LA+FR corpus; we omit the results for clarity. 
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Figure 4: Effect of re-rank step on average precision 
for the interpolation algorithm as A varies. For AP89 
and AP88+89 k=40; for LA+FR k=10. 



Smoothing. The sensitivity of the LM approach to choice of 
smoothing technique and smoothing parameters has promp- 
ted a great deal of research [191 171 IT5|. However, we found 
that for our algorithms, simply setting the Dirichlet smooth- 
ing parameter /i to a suggested value of 2000 1191 (or, as 
it turned out, randomly-chosen values within the neigh- 
borhood of 2000) outperformed the /^-optimized baseline. 
Moreover, experiments with Jelinek-Mercer and absolute 
discounting — two other well-known single-parameter smooth- 
ing methods |19| — yielded the same outcome of relative 
insensitivity to choice of parameter value for the underlying 
smoothing method employed. 

Feature selection. Another interesting observation is that 
effective incorporation of cluster information somewhat ob- 
viates the need for feature selection. In particular, Table |S| 



shows one case where using the aspect-x and interpolation 
algorithms without a stemmer or stop- word list outperforms 
the baseline with access to the Porter stemmer with respect 
to average precision and recall. On the other hand, stem- 
ming led to degradation of precision at zero. Results for 
cluster sizes other than 40 and different corpora were con- 
sistent with these findings. 

Is it all due to language modeling? Throughout this 
paper, we have used language models as our information 
representation. An interesting question is whether it is the 
representation (e.g., p c (-)), or the source of this representa- 
tion (e.g. c itself) that matters most. We therefore explored 
the effect of using an alternative representation. Specifi- 
cally, both the queries and the documents were represented 
using log-based tf.idf, with the inner product as distance 
measure. As before, clusters were treated as large docu- 
ments formed by concatenating their contents. Altering our 
selection algorithms (basis-select, set-select, and bag-select) 
in this way led to improved performance with respect to the 
basic tf.idf retrieval algorithm, as shown in Table |7J On 
the other hand, these algorithms did not do as well as their 
original, LM-based counterparts. We thus see that our al- 
gorithmic framework can boost performance for other infor- 
mation representations over the structure-blind alternative, 
but language models do seem to have advantages, at least 
in comparison to tf.idf. 

7. CONCLUSIONS 

In summary, we have proposed a general framework that 
enables the development of a variety of algorithms for in- 
tegrating corpus similarity structure, modeled via clusters, 
and document-specific information. Although our proposal 
is motivated by the recent language-modeling approach to 
information retrieval, and the specific algorithms presented 
here do use language models for representation purposes to 
good effect, we observed that the framework also can be 
used with basic classic IR techniques such as tf.idf. 

An interesting direction for future work is to explore the 
effect of using alternative clustering algorithms. We would 





S-Baseline 


U-Baseline 


S-aspect-x 


U-aspcct-x 


S-intcrpolation (A = 0.4) 


U-interpolation (A = 0.6) 


Avg. Prec. 


21.03% 


19.56% 


22.6%* 


21.51%,* 


24.9%* 


24.08%,* 


Prec. at 


57.4% 


60.1% 


58.2% 


60.9% 


55.8% 


58.9% 


Recall 


48.67% 


44.99% 


62.16%,* 


60.47%,* 


63.62%* 


60.66%,* 



Table 6: Stemming comparison on AP89. S-: stemmed version; U-: un-stemmed version. Cluster size k = 40. 
Significant differences are reported with respect to the corresponding baseline. 





tf.idf version 


LM version 


Baseline 


basis-select 


set-select 


bag-select 


Baseline 


basis-select 


set-select 


bag-select 


Avg. Prec. 


16.43% 


16.67% 


17.18%* 


16.68%,* 


22.16% 


21.92% 


22.52% 


22% 


Prec. at 


46.66% 


46.94% 


47.29% 


46.92%, 


57.37% 


57.91% 


57.89%, 


58.16% 


Recall 


47.45% 


55.07% 


57.58% 


51.98% 


48.31% 


55.64% 


58.09% 


53.34% 



Table 7: Simple similarity metric based on tf.idf vs. LM-based similarity on LA+FR. Cluster size k = 10. 



also like to study the role that overlapping plays in our 
framework: is most of the performance gain due to the 
(high) degree of overlap in our clusters or to the way struc- 
ture and individual-document information are integrated? 
Another interesting direction is to examine whether other 
algorithms, such as the LM-based pseudo-feedback methods 
we used for reference comparisons |11II13| . can benefit if we 
replace the basic LM retrieval algorithm they employ with 
one of ours. 

Most importantly, we would like to develop a principled 
probabilistic interpretation of the framework we have pro- 
posed. We have done some preliminary work based on con- 
sidering the factorization p(q\d) = ~}2 c p(q\d, c)p(c\d); some 
of the components of our scoring functions can be consid- 
ered to be (very rough) approximations of the terms in this 
factorization. Creating a rigorous probabilistic foundation 
for the work described here is one of our main future goals. 
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